AITopics | speech synthesizer

Collaborating Authors

speech synthesizer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder

Suh, Soobin, Ahn, Dabi, Park, Heewoong, Park, Jonghun

arXiv.org Artificial IntelligenceApr-17-2025

V oice conversion is a task of synthesizing an utterance with target speaker's voice while maintaining linguistic information of the source utterance. While a speaker can produce varying utterances from a single script with different intonations, conventional voice conversion models were limited to producing only one result per source input. To overcome this limitation, we propose a novel approach for voice conversion with diverse intonations using conditional variational autoencoder (CV AE). Experiments have shown that the speaker's style feature can be mapped into a latent space with Gaussian distribution. We have also been able to convert voices with more diverse intonation by making the posterior of the latent space more complex with inverse autoregressive flow (IAF). As a result, the converted voice not only has a diversity of intonations, but also has better sound quality than the model without CV AE.

artificial intelligence, machine learning, utterance, (15 more...)

arXiv.org Artificial Intelligence

2504.12005

Country:

North America > United States (0.14)
Asia (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

Yu, Wenyi, Wang, Siyin, Yang, Xiaoyu, Chen, Xianzhao, Tian, Xiaohai, Zhang, Jun, Sun, Guangzhi, Lu, Lu, Wang, Yuxuan, Zhang, Chao

arXiv.org Artificial IntelligenceNov-27-2024

Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni's versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon.

arxiv preprint arxiv, salmonn-omni, speech, (13 more...)

arXiv.org Artificial Intelligence

2411.18138

Country:

Asia > South Korea > Incheon > Incheon (0.05)
Asia > South Korea > Seoul > Seoul (0.05)
Asia > Singapore (0.05)
(4 more...)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Reviews: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing SystemsOct-7-2024, 11:48:38 GMT

This work offers a clearly defined extension to TTS systems allowing to build good quality voices (even unseen ones during training of either component) from a few adaptation data-points. Authors do not seem to offer any truly new theoretical extension to "building blocks" of their system, which is based on known components proposed elsewhere (speaker encoder, synthesizer and vocoder are based on previously published models). However, their mutual combination is clever, well-engineered and allows building blocks to by independently estimated in either unsupervised (speaker encoder, where audio transcripts are not needed) or supervised (speech synthesizer) ways, on different corpora. This allows for greater flexibility, reducing at the same time requirements for large amounts of transcribed data for each of the components (i.e. Good points: - clear, fair and convincing experiments - trained and evaluated on public corpora, which greatly increases reproducibility (portion of the experiments is carried on proprietary data, but all have equivalent experiments constrained to publicly available data) Weak points: - it would probably make sense to investigate the additional adaptability in case one gets more data per speaker, it seems your system cannot easily leverage more than 10s of reference speech data Summary: this is a very good study on generating multi-speaker TTS systems from small amounts of target speaker data.

multispeaker text-to-speech synthesis, speaker verification, transfer learning, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)

Add feedback

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

Lee, Sang-Hoon, Choi, Ha-Yeong, Kim, Seung-Bin, Lee, Seong-Whan

arXiv.org Artificial IntelligenceNov-27-2023

Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For text-to-speech, we adopt the text-to-vec framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution framework from 16 kHz to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis. Audio samples and source code are available at https://github.com/sh-lee-prml/HierSpeechpp.

representation, speech, synthesis, (13 more...)

arXiv.org Artificial Intelligence

2311.12454

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Zhang, Ziqiang, Zhou, Long, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, Wei, Furu

arXiv.org Artificial IntelligenceJan-5-2023

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

artificial intelligence, natural language, neural codec language model, (3 more...)

arXiv.org Artificial Intelligence

2301.02111

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Disabled lawmaker first in Japan to use speech synthesizer during Diet session

The Japan TimesNov-7-2019, 15:40:51 GMT

A lawmaker with severe physical disabilities attended his first parliamentary interpellation Thursday since being elected in July and became the first lawmaker in Japan ever to use an electronically-generated voice during a Diet session. In the session of the education, culture and science committee, Yasuhiko Funago, who has amyotrophic lateral sclerosis, a condition also known as Lou Gehrig's disease, greeted the committee using a speech synthesizer. He also asked questions through a proxy speaker. "As a newcomer, I am still inexperienced, but with everyone's assistance, I will do my best to tackle (issues)," he said at the beginning of the session. An aide then posed questions on his behalf and expressed his desire to see improvements in the learning environment for disabled children.

diet session, funago, speech synthesizer, (6 more...)

The Japan Times

Country: Asia > Japan (0.64)

Industry:

Health & Medicine > Therapeutic Area > Rheumatology (0.94)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (0.94)
Health & Medicine > Therapeutic Area > Neurology > Amyotrophic Lateral Sclerosis (ALS) (0.94)
Health & Medicine > Therapeutic Area > Musculoskeletal (0.94)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.64)

Add feedback

A Fully Time-domain Neural Model for Subband-based Speech Synthesizer

Rabiee, Azam, Kim, Geonmin, Kim, Tae-Ho, Lee, Soo-Young

arXiv.org Artificial IntelligenceJul-1-2019

This paper introduces a deep neural network model for subband-based speech synthesizer. The model benefits from the short bandwidth of the subband signals to reduce the complexity of the time-domain speech generator. We employed the multi-level wavelet analysis/synthesis to decompose/reconstruct the signal into subbands in time domain. Inspired from the WaveNet, a convolutional neural network (CNN) model predicts subband speech signals fully in time domain. Due to the short bandwidth of the subbands, a simple network architecture is enough to train the simple patterns of the subbands accurately. In the ground truth experiments with teacher-forcing, the subband synthesizer outperforms the fullband model significantly in terms of both subjective and objective measures. In addition, by conditioning the model on the phoneme sequence using a pronunciation dictionary, we have achieved the fully time-domain neural model for subband-based text-to-speech (TTS) synthesizer, which is nearly end-to-end. The generated speech of the subband TTS shows comparable quality as the fullband one with a slighter network architecture for each subband.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

1810.05319

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > South Korea > Daejeon > Daejeon (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Brain implants, AI, and a speech synthesizer have turned brain activity into robot words

#artificialintelligenceFeb-1-2019, 23:52:24 GMT

Neural networks have been used to turn words that a human has heard into intelligible, recognizable speech. It could be a step toward technology that can one day decode people's thoughts. A challenge: Thanks to fMRI scanning, we've known for decades that when people speak, or hear others, it activates specific parts of their brain. However, it's proved hugely challenging to translate thoughts into words. A team from Columbia University has developed a system that combines deep learning with a speech synthesizer to do just that.

artificial intelligence, brain activity, machine learning, (5 more...)

#artificialintelligence

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.44)

Add feedback

New AI Mimics Any Voice in a Matter of Minutes

#artificialintelligenceMay-24-2017, 15:52:25 GMT

The story starts out like a bad joke: Obama, Clinton and Trump walk into a bar, where they applauded a new startup based in Montreal, Canada called Lyrebird. If the scenario seems too bizarre to be real, you're right--it's not. The entire recording was generated by a new AI with the ability to mimic natural conversation, at a rate much faster than any previous speech synthesizer. From there, it adds an extra layer of emotion or special intonation, until it nails a person's voice, tone and accent--may it be Obama, Trump or even you. While Lyrebird still retains a slight but noticeable robotic buzz characteristic of machine-generated speech, add some smartly-placed background noise to cover up the distortion, and the recordings could pass off as genuine to unsuspecting ears.

artificial intelligence, lyrebird, machine learning, (15 more...)

#artificialintelligence

Country: North America > Canada > Quebec > Montreal (0.26)

Industry:

Information Technology > Security & Privacy (0.48)
Media (0.31)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.37)

Add feedback

Glove-TalkII: Mapping Hand Gestures to Speech Using Neural Networks

Fels, Sidney, Hinton, Geoffrey E.

Neural Information Processing SystemsDec-31-1995

Glove-TaikII is a system which translates hand gestures to speech through an adaptive interface. Hand gestures are mapped continuously to 10 control parameters of a parallel formant speech synthesizer. The mapping allows the hand to act as an artificial vocal tract that produces speech in real time. This gives an unlimited vocabulary in addition to direct control of fundamental frequency and volume. Currently, the best version of Glove-TalkII uses several input devices (including a CyberGlove, a ContactGlove, a 3-space tracker, and a foot-pedal), a parallel formant speech synthesizer and 3 neural networks.

configuration, mapping, speech, (17 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.30)
Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.05)
North America > United States > New York (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision > Gesture Recognition (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.73)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.58)

Add feedback